{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HyperParameterTuning - Fighting Breast Cancer\n",
    "\n",
    "This tutorial shows how SynapseML can be used to identify the best combination of hyperparameters for your chosen classifiers, ultimately resulting in more accurate and reliable models. In order to demonstrate this, we'll show how to perform distributed randomized grid search hyperparameter tuning to build a model to identify breast cancer. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 - Set up dependencies\n",
    "Start by importing pandas and setting up our Spark session."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, read the data and split it into tuning and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = spark.read.parquet(\n",
    "    \"wasbs://publicwasb@mmlspark.blob.core.windows.net/BreastCancer.parquet\"\n",
    ").cache()\n",
    "tune, test = data.randomSplit([0.80, 0.20])\n",
    "tune.limit(10).toPandas()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the models to be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.automl import TuneHyperparameters\n",
    "from synapse.ml.train import TrainClassifier\n",
    "from pyspark.ml.classification import (\n",
    "    LogisticRegression,\n",
    "    RandomForestClassifier,\n",
    "    GBTClassifier,\n",
    ")\n",
    "\n",
    "logReg = LogisticRegression()\n",
    "randForest = RandomForestClassifier()\n",
    "gbt = GBTClassifier()\n",
    "smlmodels = [logReg, randForest, gbt]\n",
    "mmlmodels = [TrainClassifier(model=model, labelCol=\"Label\") for model in smlmodels]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 - Find the best model using AutoML\n",
    "\n",
    "Import SynapseML's AutoML classes from `synapse.ml.automl`.\n",
    "Specify the hyperparameters using the `HyperparamBuilder`. Add either `DiscreteHyperParam` or `RangeHyperParam` hyperparameters. `TuneHyperparameters` will randomly choose values from a uniform distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.automl import *\n",
    "\n",
    "paramBuilder = (\n",
    "    HyperparamBuilder()\n",
    "    .addHyperparam(logReg, logReg.regParam, RangeHyperParam(0.1, 0.3))\n",
    "    .addHyperparam(randForest, randForest.numTrees, DiscreteHyperParam([5, 10]))\n",
    "    .addHyperparam(randForest, randForest.maxDepth, DiscreteHyperParam([3, 5]))\n",
    "    .addHyperparam(gbt, gbt.maxBins, RangeHyperParam(8, 16))\n",
    "    .addHyperparam(gbt, gbt.maxDepth, DiscreteHyperParam([3, 5]))\n",
    ")\n",
    "searchSpace = paramBuilder.build()\n",
    "# The search space is a list of params to tuples of estimator and hyperparam\n",
    "print(searchSpace)\n",
    "randomSpace = RandomSpace(searchSpace)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, run TuneHyperparameters to get the best model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bestModel = TuneHyperparameters(\n",
    "    evaluationMetric=\"accuracy\",\n",
    "    models=mmlmodels,\n",
    "    numFolds=2,\n",
    "    numRuns=len(mmlmodels) * 2,\n",
    "    parallelism=1,\n",
    "    paramSpace=randomSpace.space(),\n",
    "    seed=0,\n",
    ").fit(tune)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 - Evaluate the model\n",
    "We can view the best model's parameters and retrieve the underlying best model pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(bestModel.getBestModelInfo())\n",
    "print(bestModel.getBestModel())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can score against the test set and view metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.train import ComputeModelStatistics\n",
    "\n",
    "prediction = bestModel.transform(test)\n",
    "metrics = ComputeModelStatistics().transform(prediction)\n",
    "metrics.limit(10).toPandas()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}